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STUDENT EVALUATIONS RELATED TO FREQUENCY OF TESTINC7 

By Steven L. Shapiro 
Barry A. Stein 



An interesting finding concerning student evaluations of teachers was 
noted recently as a result of an experiment designed to deternune the effects of 
differences of frequency of testing on the performance of students enrolled 
under two different adinisslons policies in an urban community college. The 
two policies are open admissions, which admits all high schoo^ graduates re- 
gardless of average, and selective admissions, which in N.Y.C. prior to Fall 
1970; required most high school graduates to have a 75 per cent average or 
higher to gain admission to a comm.unity college. 

DESIGN 



The experimental sample was selected from those students registering 
for Business Organization and Management for the Fall 1972 semester at 



Dr. Steven L. Shapiro and Di\ Barry A. Stein are assistant professors 
in the department of business, Queensborough GonuTiunity College, Bayside, 
N.Y. 11364. The authors thank i^rofessor Sheldon Somerstein, chairman of the 
business department for his outstanding cooperation tliroughoiit the experiment. 



Queensborough Community College. The sample was distributed into twelve 



classes, divided into three treatment groups - four receiving 10 tests, four [> 



tests and four 3 tests during the term. Room and time assignments were made 



at random. Each group was given the same 150 multiple-choice que.r .ions 



during the semester to measure learning. Each class took the same 100 item 



multiple-choice final examination. 



'"here were four instructors teaching the twelve classes in the inves- 



tigation; each teaching a 10 test, 5 test and 3 test class (see Table I). The 



instructors met weekly with the experiment leader to discuss the topics to be 



covered and methodology to be used. 



Although all three treatment groups were composed of students from 
two different high school academic levels (below 75% and 75% or above), the 



groups were proven to bo comparable by a two way analysis of variance on 



the variables high school average and reading and English expression scores. 



Queensborough Community College is a branch of the City University 
of New York . 



RESULTS AND CONCLUSIONS 



Through analysis of the data, it was noted that the students taking 



more than three test (experimental groups) had significantly higher (.01) final 



examination scores and final course grades than those being given three tests 



during the semester (control group). To be precise, open admissions freshmen 



did best when tested ten times during the term while regular freshnien achieved 



most when tested five times throughout the semester (see Table 2). 



It was further observed that the students in the experimental groups 



rated their instructors higher in all categories of the student evaluation form 



used throughout the college. Although the nature of the evaluation instrumeiU 



precludes identification of individual students, the overall findings found in 



Table 3 indicate surface validity and again raise a question that has been 



pondered for many years: What really is the relationship between students' 



achievement and teacher ratings? 



In this experiment, the objective test results which show increased 



learning by students taking five or ten tests as opposed to three, agree with 



the subjective student evaluations of teachers. Although differences in subject 
matter, teachers and methodology affected the evaluations, all three groups 
were affected due to selective manipulation in setting up the investigation. It 
is assumed that the findings were produced by the independent variable, fre- 
quency of testing, and not by differences in subject matter, teachers or 
methodology . 

The findings of this study support D.N. Elliot who in a study of a large 

introductory chemistry course, concluded that "... there is probably, in 

general, a positive relationship between the ratings given an instructor by his 

2 

students and their achievement . . They also are compatable with H.H. 

Remmers who, in essentially the same experimental design, concluded that 

"... there is warrant for ascribing validity to student ratings ... as mea- 

3 

sured by what students actually learn of the content of the course." 

Some investigators have found a negative correlation between the 



^D.N, Elliot, Purdue University Student Higher Education , 70, 5 (1950) 
3 

H.H. Remmers, F.D. Martin, D.N. Elliot, Purdue University Student 
Higher Education , 66, 17 (1949). 



amount learned from an instructor and the students' evi.uation of his teaching 
performance. Rodin and Rodin, in a study of 293 students in an undergraduate 
calculus course, concluded that "... the instructors with the three lowest 
subjective scores received the three highest objective scores while the in- 

4 

structor with the highest subject rating was lowest on the objective measure." 

R.H. Knapp found evidence that student evaluations, to a large extend, 
tend to reflect the personal and social qualities of an instructor, "who he is" 
rather than "what he does."^ The results of this investigation indicate that 
"what he does" and not "who he is" determines to some degree the results of 
the student evaluation. Testing frequency seems to have been measured 
rather than individual teachmg abilities. Each of the four participating in- 
structors received their best evaluations as a result of increasing exam 
frequency. Collectively, they did not receive their highest rating in any of 
the ten categories from the 3 test group. The results indicate that students 



"^M. Rodin and B. Rodin, Science 177, 4055 (1972). 

^R.H. Knapp, The American College , N. Sanford, Ed. (Wiley, New 
York, 1962), pp. 290-311. 
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taking !0 tests rate the instructors highest in categories 3, 4, 6, 1 , 8, 9, and 
10. Students in the 5 test group evaluate the inscructors best in categories 1, 
2 and 5. While it must be noted that the teachers themselves may have placed 
more emphasis on the experimental classes due to the nature of the study, these 
findings do show a pattern which indicates the importance of course modification 
in affecting student evaluations of teachers. 

VARIATIONS IN RATINGS 
The student evaluation of faculty is used today for purposes oi re- 
hiring, promotions and tenure. It can be an important determinant in the rel- 
ative success or ultimate failure of a teacher's career. In examining the in- 
dividual categories more closely, it appears that students are measuring their 
image of achievement rather than teacher performance. By scanning column A 
we see that alchough all four teachers were required to cover the same topics 
during the term and all students received the sam.e 150 multiple-choice items, 
the differences in teacher ratings are quite evident. Overall mean final exam- 
ination scores differ significantly by approximately three points between the 10 
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test and 3 test group (see Table 2) while student evaluations differ by as much 



as 36.9 per cent (category 7) between the same two groups. 



Interestingly, the greatest variation occurs in the category: Evaluates 



students' work. Students in the IL test group who wore constantly evaluated, 



indicated this on the rating form. While it was true, the students took the 



same number of test items during the semester as those evaluated less frequently 



and the mean total items correct during the term was 100.89 for the 10 test group 



and 101.4 for the 3 test group. 



The category: How would you describe instructor to others? , shows 



the second greatest variation of ratings (35.9 per cent). This category is a 



particularly important one. It shows that when the four teachers in the experi- 



ment gave three tests during the semester, 24 per cent of the stude:.ts raled 



them excellent. When five tests were used, 53 per cent responded excellent 



and when 10 tests were given, 61 per cent described the instructors as excellent. 



Accepting possible variations in teacher motivation toward the various groups, 



the percentages still strongly favor the instructors when they used increased 



tes t frequency . 



Although the category: Rate your own performance; shows a relatively 



small variation of ratings (10.2 per cent) / it is important to recognize that 



students have a better self-concept when undergoing higher frequency testing. 



It is even more evident when columns A and B are combined and the variation 



increases to 2 5.7 percent. 



The student evaluations of instructors were conducted prior to the final 



examination. At that time, students were not aware of final exam grades or 



final course grades. The only evaluations of students were in the form of exam 



grades. The mean total items correct from these exams for the three groups had 



a variation of .509 (the difference in mean total items correct between the 10 



Lest group and 3 test group) which is remarkably small when considering the 



number of studenfsinvolved and the number of test items administered. The 258 



students responding (81.6% of the 316 finishing the semester in the twelve 



classes) should have rated the instructors practically the same in each fre- 



quency group since achievement had been virtually the same up to that point. 



« 



Since final examination mean scores show significant differences between 
the groups, higher ratings as exam frequency increases indicate a relationship 
between student learning (achievement) and teacher ratings. This positive 
relationship, however, goes further than merely stating a possible correlation. 
The real question becomes: Are teacher ratings subject to actual student 
achievement which may be created by one or more course variables? 

" THE GREATER LEARNING IMAGE " 
In this study, the conclusions of Elliot and Remmers are substantiated 
while those of F^odin and Rodin and Knapp are not. There does seem to be a 
relationship between achievement and ratings but certainly not a simple one. 
The ratings seem to have been made, to a great extent, according to the stu- 
dents perception of learning throughout the semester. This "greater learning 
image" may be the result of several factors. Perhaps increased test frequency 
• as opposed to three major examinations reduced test anxiety and made no one 
test critical. Another reason may have been the personal contact between 
teacher and student which developed as a result of constant item discussion 
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and grade distribution throughout the term. A third possiblity could have 
been that as a routine of testing was established, students liked haviiig 
fewer topics on each exam, and found studying to be easier. All or any of 
these factors may have had a much greater effect on the student evalautions 
than the actual increased achievement which is evidenced by the final exam- 
ination mean scores. 

SUMMARY 

The variable testing frequency seems to have a q^^at influence on 
teacher evaluations . Results show tremendous increases in ratings as test 
frequency increases. "Who the teacher is" as opposed to "what he does" 
seems to be unimportant. What is important is how student achievement and 
evaluation of faculty are affected by frequency of testing. There is a signi- 
ficant relationship between test frequency and student achievement on the 
final examination. From this standpoint, the appraisal instrument used to 
evaluate the teachers is valid to some extent. There appears to be a striking 
relationship between the "greater learning image" created in this investigation 



II 



by increased test freauency and s tudent evaluation of instructors. 



Wif-.li the student evaluation becoming a more and more important part 



of the success or failure of the college teacher, there is no doubt that a great 



many questions concerning these evaluations are still only vaguely answered. 



The researcher must no longer be concerned with only teacher effecciveness 



but also concentrate on the ingredients that contribute to the overall effective- 



ness of instruction. 



The instructor, on the other hand, anxious for a high rating m category 



9 as well as all the others, must begin to seek out various means of stimulating 



the ratings. In the sense that this may lead to experimentation and educational 



advances, this is fine. If, however, this pursuit leads to the use of teaching 



gimmicks, designed only to improve image and not instruction, the teacher 



evaluation idea fails. Those using the evaluations must read them carefully 



and always remember that while numbers don't lie, they do sometimes exaggerate 
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